Predicting the COVID‐19 mortality among Iranian patients using tree‐based models: A cross‐sectional study

Abstract Background and Aims To explore the use of different machine learning models in prediction of COVID‐19 mortality in hospitalized patients. Materials and Methods A total of 44,112 patients from six academic hospitals who were admitted for COVID‐19 between March 2020 and August 2021 were included in this study. Variables were obtained from their electronic medical records. Random forest‐recursive feature elimination was used to select key features. Decision tree, random forest, LightGBM, and XGBoost model were developed. Sensitivity, specificity, accuracy, F‐1 score, and receiver operating characteristic (ROC)‐AUC were used to compare the prediction performance of different models. Results Random forest‐recursive feature elimination selected following features to include in the prediction model: Age, sex, hypertension, malignancy, pneumonia, cardiac problem, cough, dyspnea, and respiratory system disease. XGBoost and LightGBM showed the best performance with an ROC‐AUC of 0.83 [0.822−0.842] and 0.83 [0.816−0.837] and sensitivity of 0.77. Conclusion XGBoost, LightGBM, and random forest have a relatively high predictive performance in prediction of mortality in COVID‐19 patients and can be applied in hospital settings, however, future research are needed to externally confirm the validation of these models.

the last 7 days. 3 Mortality has not diminished completely, despite its reduction after the administration of vaccines. In total, 24,863 patients died worldwide during January 9 to 16, 2023. 3 Research suggest that vaccine effectiveness decreases from 80% to 30% after 6 months. 4,5 More people can be at risk of contracting COVID-19, hospitalization, and mortality in the future with fewer people being boosted.
Complex statistical approaches have been widely applied in various research fields such as identifying biomarkers 6,7 for cardiovascular disease 8 and diabetes with advances in computational systems during the past two decades.
Machine learning (ML) is considered as a branch of artificial intelligence, which aims to identify and learn patterns in complex data. ML models are broadly utilized in healthcare to develop prognostic and diagnostic models. 9,10 Demonstrating the appropriateness of ML algorithms plays a critical role in applied healthcare. A large number of ML models have been developed to guide healthcare professionals in recent years.
Such prediction models may simultaneously assist healthcare workers in diagnosing patients and identifying high-risk patients who may need extra care.
Previous studies have utilized different ML models to predict COVID-19 mortality based on various features, including clinical symptoms and demographic information. [11][12][13][14] However, some clinical features require further laboratory testing which can be timeconsuming and hinder the ability to predict mortality rates upon admission. Additionally, previous studies on predicting  outcomes based on comorbidities have been limited by small sample sizes.
Patients' hospital and medical records are regarded as valuable sources for obtaining information about their medical history and comorbidities. This study aims to demonstrate the effectiveness of ML models in predicting COVID-19 mortality based on demographic features and comorbidities, as well as comparing their performance.
To achieve this aim, a data set was collected from the electronic medical records related to six academic hospitals utilizing the recursive feature elimination (RFE) method to find relevant features which could contribute to the outcome of COVID-19 patients. The data set was applied after feature selection to build prediction models based on ML algorithms to classify the patients' outcome into death and survival groups. Validation analysis was performed to evaluate the predictive power of each model. This study seeks to demonstrate the ability of several ML models to predict COVID-19 mortality to assist healthcare professionals and compare their performances.
This study utilized a larger sample size than most relevant literature and assessed various ML algorithms, including novel models such as extreme gradient boosting (XGB) and light gradient boosting machine (LightGBM). Furthermore, classic ML algorithms such as decision tree (DT) and random forest (RF) were used to build the models.

| Variables
The collected data set includes demographic features such as sex and age, symptoms such as cough, fever, chest pain, dyspnea, and their medical background including comorbidities such as hypertension, diabetes, obesity, acute renal failure (ARF), chronic kidney disease (CKD), cardiac problems, hepatic failure, malignancy, history of pneumonia, respiratory system disease, and disease of the nervous system ( Table 1). All of the variables, except age are coded as categorical. The outcome was the death of a patient with COVID-19 during hospitalization.

| Data analysis
The Python programming language is used for preprocessing, feature selection, training, and evaluation of the models, as shown in Figure 1. In addition, the Scikit-learn library 15 is applied to split the data into a train and test set, as well as building and evaluating the presented model. Further, Pandas, 16 NumPy, 17 and Matplotlib 18 libraries are used for data preprocessing and illustrating graphs.
First, the features observed in less than 0.2% of the patients were eliminated, resulting in the dropping of the ARF feature in the feature selection process due to sparsity.  Random forest RFE (RF-RFE) method was utilized for feature selection. 19 This iterative method trains a model, ranks features based on their significance, and eliminates the feature with lowest ranking. The above-mentioned process resumes until a specific cutoff or criterion such as selecting the minimum number of features is achieved.
The data set was split into train (80%) and test (20%) sets.
Stratification was applied based on the death outcome to ensure the existence of enough samples from both outcomes in the aforementioned sets since the data is heavily imbalanced. In addition, class weight was used in data training to reduce the negative effect of data imbalances. The data in each class will have an inversely proportional weight of data frequency and gives more weight to each observation from the minority class. Further, the fivefold cross-validation (CV) method was utilized for training the data and hyperparameter adjustment. The training data is split into fivefolds and trained in all except onefold which is used for validation. The final model with adjusted hyperparameters was applied to the test set to evaluate the effectiveness of the models. Randomized search CV was employed for hyperparameter tuning in XGBoost and LightGBM models. Furthermore, grid search CV was applied for RF and DT models.
Four ML models including DT, RF, XGB, and LightGBM were used to build a prediction model for mortality. Model performance was investigated by the area under the receiver operating characteristic (ROC) curve (AUC). Other evaluation metrics such as accuracy, sensitivity, specificity, and F-1 score were acquired, as well.
DT is considered as a supervised nonlinear ML algorithm, which can be utilized for classification problems.
RF is regarded as an ensemble tree-based method, which applies subsets of samples, as well as creating and aggregating multiple DTs to generate a predictive model. 20,21 Boosting is another ensemble method to improve DT predictions.
Boosting uses and aggregates a series of weak learners to build a final prediction model. 20 Table 2.
History of pneumonia is considered as the most prevalent feature among patients who died (N = 1888). In addition, 4194 (10.9%) patients in survived group exhibit a history of pneumonia.
A significant difference is reported between groups in terms of age.
Median and mean (SD) of age equal 51 and 52.1 (17.0%) for survived group, and 69 and 66.9 (15.6%) for the patients who died, respectively.
The density plot ( Figure 2) of age also capitalizes this difference. F I G U R E 2 Density plot of age for different outcomes (bin_outcome = 0 indicates the survived group). Overall, the prevalance proportion of comorbidities is higher in the death group. However, most symptoms such as cough, fever, and chest pain are more proportionally prevalant in the survived group. RF-RFE was applied for feature selection and minimum features to select was set at nine features. Following features were selected to include in the model: Age, sex, hypertension, malignancy, pneumonia, cardiac problem, cough, dyspnea, and respiratory system disease.

As shown in
CV was performed to acquire the best hyperparameters in each model. Gini impurity was used in the DT as the split criterion. Max depth and the minimum sample for leaf and split equals five. As represented in Table 3, the DT displays    As shown in Figure 5, age RF shows a notably better performance than the DT due to several reasons. DTs are regarded as non-robust, meaning that even a minor alteration in data can create a huge difference in the final model. Bagging, RFs, and boosting are proposed to eliminate these obstacle.
Bagging utilizes subsets of samples to build multiple trees and takes a majority vote between trees to classify each observation.
Bagging uses all of the available features to build each tree which results in increased correlation between the trees and and thus it does not contribute a major role to reducing the variance.
RF improves bagging using a subset of features for splitting and building a tree, leading to a de-correlation and more difference between trees and hence averaging is more reliable and reduces the variance more than the two previous methods. Boosting is another method to improve DT results.
Boosting is similar to bagging. However, the trees are considered independent in bagging whereas in boosting trees are built in a sequence, meaning that each tree learns and corrects the previously built tree. 20,24 The aforementioned models except DT exhibit relatively high AUC-ROC of 0.81 and 0.83 utilizing only demographic features, medical history, and comorbidities of the patients. These ML models demonstrated that patients who are more likely to die can be appropriately identified upon their admission to hospitals. Based on these models, more than 70% of patients that die can be identified with only limited sets of variables and without conducting a single laboratory test. Therefore, such models are considered as practical for predicting mortality among COVID-19 patients. Predicting the patients' outcome upon their admission can be useful in hospitalsettings, especially in low and middle-income countries where staff, drug, and bed shortages often occur. Therefore, high-risk patients can be identified and prioritized by applying these ML models.
Several models were previously offered to predict mortality using laboratory results. However, acquiring such models needs

CONFLICT OF INTEREST STATEMENT
The authors declare no conflict of interest.

DATA AVAILABILITY STATEMENT
The data sets analyzed during the current study are available from the corresponding author upon reasonable request.

ETHICS STATEMENT
This study was approved by the ethics committee of Tehran

TRANSPARENCY DECLARATION
The lead author (Hojjat Zeraati and Mir Saeed Yekaninejad) affirms that this manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.